Table of Contents

  • 1. Column Information
  • 2. Reading Data
    • 2.1 Creating new column
  • 3. Cleaning Data
    • 3.1 Listing Inconsistent and Missing Data
    • 3.2 Fixing Inconsistent and Missing Data
    • 3.3 Finding Outliers
    • 3.4 Fixing Outliers
  • 4. Model Planning
    • 4.1 Statistical Summary
    • 4.2 Univariate Graphs
      • 4.2.1 Numeric Graphs
      • 4.2.2 Categorical Graphs
    • 4.3 Bivariate Graphs
      • 4.3.1 Numeric-Numeric Graphs
      • 4.3.2 Categorical-Categorical Graphs
    • 4.3 Advanced Plots
    • 4.4 Graph Analysis Summary
  • 5. Model Building
    • 5.1 (Polio and Hepatitis B) vs Diphtheria
    • 5.2 (Schooling and Income Composition of Resources) vs Life Expectancy
    • 5.3 - Summary and Analysis
  • 6. Operationalize
    • 6.1 Methodology
    • 6.2 Issues with implementing methodology
  • 7. Result
    • 7.1 Summary and conclusion
    • 7.2 Future recommendation

Name & IDs of the Group Members¶

   Student Names: Waleed Almutairi                     Student IDs: 202011580
                   Abdulmalik Almadhi                                202026200
                   Abdullah Alomair                                  202032920
                   Muath Alsubhi                                     202027420
                   Mohammed Aljoudi                                  202041460

1. Column Information¶

Column Description
Country Name of the country
Status Developed or Developing status
Life expectancy Life Expectancy in age
Adult Mortality Probability of dying between 15 and 60 years per 1000 population for both sexes
Infant deaths Number of Infant Deaths per 1000 population
Alcohol Alcohol, recorded per capita (15+) consumption (in litres of pure alcohol)
Percentage expenditure Expenditure on health as a percentage of Gross Domestic Product per capita(%)
Hepatitis B Hepatitis B (HepB) immunization coverage among 1-year-olds(%)
Measles Measles - number of reported cases per 1000 population
BMI Average Body Mass Index of entire population
Under-five deaths Number of under-five deaths per 1000 population
Polio Polio (Pol3) immunization coverage among 1-year-olds(%)
Total expenditure Government expenditure on health as a percentage of total government expenditure(%)
Diphtheria Diphtheria tetanus toxoid and pertussis (DTP3) immunization coverage among 1-year-olds(%)
GDP Gross Domestic Product per capita (in USD)
Population Population of the country
Thinness 1-19 years Prevalence of thinness among children and adolescents for Age 10 to 19(%)
Thinness 5-9 years Prevalence of thinness among children for Age 5 to 9(%)
Income composition of resources Human Development Index in terms of income composition of resources (index ranging from 0 to 1)
Schooling Number of years of Schooling(years)
Continent Continent of each country
In [1]:
# Importing libraries to work with
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
from random import randint, random
import numpy as np
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import LassoCV
import numpy as np
from sklearn.preprocessing import StandardScaler

2. Reading Data¶

In [5]:
df = pd.read_csv("..\Life Expectancy Data.csv")

# As the data is very clean, we will create some inconsistencies.
df["Schooling"] = [-x if randint(0,10) == 3 else x for x in df["Schooling"].values]
df["GDP"] = [-x if randint(0,10) == 3 else x for x in df["GDP"].values]
df["under-five deaths "] = [x+random() for x in df["under-five deaths "].values]

inconsistent_datatype = ["Schooling", "under-five deaths"]

fields = {"Fields":[str(x) for x in df.columns], "Types":[str(df[x].dtype) for x in df.columns],
         "Real Type": ["int64" if x in inconsistent_datatype else str(df[x].dtype) for x in df.columns]}
fields_df = pd.DataFrame(data=fields)

display(df.head())
display(fields_df)
Country Year Status Life expectancy Adult Mortality infant deaths Alcohol percentage expenditure Hepatitis B Measles ... Polio Total expenditure Diphtheria HIV/AIDS GDP Population thinness 1-19 years thinness 5-9 years Income composition of resources Schooling
0 Afghanistan 2015 Developing 65.0 263.0 62 0.01 71.279624 65.0 1154 ... 6.0 8.16 65.0 0.1 584.259210 33736494.0 17.2 17.3 0.479 10.1
1 Afghanistan 2014 Developing 59.9 271.0 64 0.01 73.523582 62.0 492 ... 58.0 8.18 62.0 0.1 612.696514 327582.0 17.5 17.5 0.476 10.0
2 Afghanistan 2013 Developing 59.9 268.0 66 0.01 73.219243 64.0 430 ... 62.0 8.13 64.0 0.1 -631.744976 31731688.0 17.7 17.7 0.470 9.9
3 Afghanistan 2012 Developing 59.5 272.0 69 0.01 78.184215 67.0 2787 ... 67.0 8.52 67.0 0.1 669.959000 3696958.0 17.9 18.0 0.463 9.8
4 Afghanistan 2011 Developing 59.2 275.0 71 0.01 7.097109 68.0 3013 ... 68.0 7.87 68.0 0.1 63.537231 2978599.0 18.2 18.2 0.454 9.5

5 rows × 22 columns

Fields Types Real Type
0 Country object object
1 Year int64 int64
2 Status object object
3 Life expectancy float64 float64
4 Adult Mortality float64 float64
5 infant deaths int64 int64
6 Alcohol float64 float64
7 percentage expenditure float64 float64
8 Hepatitis B float64 float64
9 Measles int64 int64
10 BMI float64 float64
11 under-five deaths float64 float64
12 Polio float64 float64
13 Total expenditure float64 float64
14 Diphtheria float64 float64
15 HIV/AIDS float64 float64
16 GDP float64 float64
17 Population float64 float64
18 thinness 1-19 years float64 float64
19 thinness 5-9 years float64 float64
20 Income composition of resources float64 float64
21 Schooling float64 int64

2.1 Creating new column¶

In [6]:
# Adding Extra Categorical Column named Continent

europe = ['Albania', 'Austria', 'Belgium', 'Bulgaria', 'Belarus', 'Germany', 'Denmark', 'Estonia', 'Finland', 
          'Greece', 'Hungary', 'Iceland', 'Italy', 'Lithuania', 'Luxembourg', 'Latvia', 'Malta', 'Norway', 'Poland', 
          'Portugal', 'Romania', 'Sweden', 'Slovenia', 'Slovakia', 'San Marino', 'Bolivia (Plurinational State of)',
          'Ukraine', 'Bosnia and Herzegovina', 'Croatia', 'Monaco', 'Montenegro', 'Serbia', 'Spain', 'Switzerland',
         'Czechia', 'Democratic People\'s Republic of Korea','Netherlands', 'Republic of Moldova', 'The former Yugoslav republic of Macedonia',
         'United Kingdom of Great Britain and Northern Ireland']

 

africa = ['Angola', 'Burkina Faso', 'Burundi', 'Benin', 'Botswana', 'Congo', 'Cameroon', 
          'Djibouti', 'Egypt', 'Eritrea', 'Ethiopia', 'Gabon', 'Ghana', 'Guinea', 'Guinea-Bissau', 'Kenya', 'Liberia',
          'Libya', 'Madagascar', 'Mali', 'Mauritania', 'Mauritius', 'Malawi', 'Mozambique', 'Namibia', 'Niger', 
          'Rwanda', 'Seychelles', 'Sudan', 'Senegal', 'Somalia', 'Togo', 'Tunisia', 'Uganda', 'Zambia', 'Zimbabwe', 
          'Algeria', 'Central African Republic', 'Chad', 'Comoros', 'Equatorial Guinea', 'Morocco', 'South Africa', 
          'Swaziland', 'Cabo Verde', "Côte d'Ivoire", 'Gambia', 'Sao Tome and Principe', 'South Sudan', 'United Republic of Tanzania', 'Democratic Republic of the Congo']

 

asia = ['Afghanistan', 'Armenia', 'Azerbaijan', 'Bangladesh', 'Bahrain', 'Brunei Darussalam', 'Cyprus', 'Georgia', 'Indonesia', 'Israel', 
        'Iraq', 'Jordan', 'Japan', 'Kyrgyzstan', 'Kuwait', 'Lebanon', 'Myanmar', 'Mongolia', 'Maldives', 'Malaysia', 'Oman', 
        'Philippines', 'Qatar', 'Saudi Arabia', 'Singapore', 'Thailand', 'China',
        'Tajikistan', 'Turkmenistan', 'Turkey', 'Uzbekistan', 'Yemen', 'Cambodia', 'Kazakhstan', 'United Arab Emirates', 'Iran (Islamic Republic of)', "Lao People's Democratic Republic", 'Republic of Korea',
       'Russian Federation', 'Syrian Arab Republic', 'Timor-Leste', 'Viet Nam']

 

north_america = ['Antigua and Barbuda', 'Barbados', 'Bahamas', 'Belize', 'Canada', 'Costa Rica', 'Cuba', 'Dominica', 
                 'Dominican Republic', 'Guatemala', 'Haiti', 'Honduras', 'Jamaica', 'Mexico', 'Nicaragua', 'Panama', 
                 'Trinidad and Tobago', 'El Salvador', 'Grenada', 'Saint Kitts and Nevis', 'Saint Lucia', 
                 'Saint Vincent and the Grenadines', 'United States of America']

 

south_america = ['Argentina', 'Brazil', 'Chile', 'Colombia', 'Ecuador', 'Guyana', 'Peru', 'Paraguay', 
                 'Suriname', 'Uruguay' , 'Venezuela (Bolivarian Republic of)']

 

oceania = ['Fiji', 'Kiribati', 'New Zealand', 'Papua New Guinea', 'Solomon Islands', 'Tonga', 'Vanuatu', 'Samoa', 'Cook Islands', 'Micronesia (Federated States of)', 'Niue']
continent = []
for country in df['Country'].values:
    if (country in europe):
        continent.append('Europe')
    elif (country in africa):
        continent.append('Africa')
    elif (country in asia):
        continent.append('Asia')
    elif (country in north_america):
        continent.append('North America')
    elif (country in south_america):
        continent.append('South America')
    elif (country in oceania):
        continent.append('Oceania')
    else:
        continent.append('Unknown')
        
df['Continent'] = continent

3. Cleaning Data¶

3.1 Listing Inconsistent and Missing Data¶

In [7]:
# List of created inconsistent data.
df.info()

inconsistent_data = inconsitent_datatype = ["Schooling", "under-five deaths ", "GDP"]
fields2 = {"Fields":[str(x) for x in df.columns], 
          "Inconsistencies":[True if x in inconsistent_data else False for x in df.columns],
         "Missing Data": [True if df[col].isnull().any() else False for col in df.columns]}

# Create dataframe to display inconsistent and missing data.
fields2_df = pd.DataFrame(data=fields2)
display(fields2_df)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2938 entries, 0 to 2937
Data columns (total 23 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   Country                          2938 non-null   object 
 1   Year                             2938 non-null   int64  
 2   Status                           2938 non-null   object 
 3   Life expectancy                  2928 non-null   float64
 4   Adult Mortality                  2928 non-null   float64
 5   infant deaths                    2938 non-null   int64  
 6   Alcohol                          2744 non-null   float64
 7   percentage expenditure           2938 non-null   float64
 8   Hepatitis B                      2385 non-null   float64
 9   Measles                          2938 non-null   int64  
 10   BMI                             2904 non-null   float64
 11  under-five deaths                2938 non-null   float64
 12  Polio                            2919 non-null   float64
 13  Total expenditure                2712 non-null   float64
 14  Diphtheria                       2919 non-null   float64
 15   HIV/AIDS                        2938 non-null   float64
 16  GDP                              2490 non-null   float64
 17  Population                       2286 non-null   float64
 18   thinness  1-19 years            2904 non-null   float64
 19   thinness 5-9 years              2904 non-null   float64
 20  Income composition of resources  2771 non-null   float64
 21  Schooling                        2775 non-null   float64
 22  Continent                        2938 non-null   object 
dtypes: float64(17), int64(3), object(3)
memory usage: 528.0+ KB
Fields Inconsistencies Missing Data
0 Country False False
1 Year False False
2 Status False False
3 Life expectancy False True
4 Adult Mortality False True
5 infant deaths False False
6 Alcohol False True
7 percentage expenditure False False
8 Hepatitis B False True
9 Measles False False
10 BMI False True
11 under-five deaths True False
12 Polio False True
13 Total expenditure False True
14 Diphtheria False True
15 HIV/AIDS False False
16 GDP True True
17 Population False True
18 thinness 1-19 years False True
19 thinness 5-9 years False True
20 Income composition of resources False True
21 Schooling True True
22 Continent False False

3.2 Fixing Inconsistent and Missing Data¶

In [8]:
# List of columns with NaN values.
null_columns=df.columns[df.isna().any()]

# Imputing all columns with NaN values.
for c in null_columns:
    if df[c].dtype!='object':
        value = df[c].mean()
    else:
        value = df[c].mode()
        value = value[0]  
    df[c].fillna(value,inplace=True)

# Fix the inconsistent datatypes of float64 to int64.
df['under-five deaths '] = df['under-five deaths '].apply(lambda x: int(x))

# Fix the inconsistent datatypes with negative values.
df['GDP'] = df['GDP'].apply(lambda x: abs(x))
df['Schooling'] = df['Schooling'].apply(lambda x: abs(x))

3.3 Finding Outliers¶

In [9]:
plt.figure(figsize=(15,10))
sns.boxplot(data=df)
plt.xticks(rotation=45)
plt.show()

numeric_columns = df.select_dtypes(exclude='object').columns.drop('Year')
# From this graph we decided to make the threshold = 2 to limit most outliers.

# Scaled all numeric data.
scaled_values = StandardScaler().fit_transform(df[numeric_columns])

df2=pd.DataFrame(scaled_values,columns=df[numeric_columns].columns)

# Plotting scaled and non-scaled dataframe

plt.figure(figsize=(15,10))
sns.boxplot(data=df2)
plt.xticks(rotation=45)
plt.show()    

3.4 Fixing Outliers¶

In [10]:
# print shape to check initial rows.
print(df2.shape)

# From this graph we decided to make the threshold = 2 to limit most outliers.
threshold = 2
selected_rows= (df2<threshold).all(axis=1) & (df2>-threshold).all(axis=1)
selected_index=df[~selected_rows].index
df2.drop(index=selected_index,inplace=True)
ndf=df.drop(index=selected_index)
ndf.reset_index(inplace = True, drop = True)
# Print shape to check final amount of rows.
print(df2.shape)

plt.figure(figsize=(15,15))
sns.boxplot(data=df2)
plt.xticks(rotation=45)
plt.show()

# Replacing the values from df and df2
df.drop(index=selected_index,inplace=True)
print(df.shape)
(2938, 19)
(1728, 19)
(1728, 23)

4. Model Planning¶

4.1 Statistical Summary¶

In [11]:
# Statistical summary of numeric columns.
display(df.describe().T)

# Statistical summary of categorical columns.
display(df.describe(include='object').T)
count mean std min 25% 50% 75% max
Year 1728.0 2.008086e+03 4.608478e+00 2000.00000 2004.000000 2.008000e+03 2.012000e+03 2.015000e+03
Life expectancy 1728.0 7.079758e+01 7.291746e+00 51.00000 66.300000 7.270000e+01 7.540000e+01 8.800000e+01
Adult Mortality 1728.0 1.465064e+02 9.325697e+01 1.00000 76.000000 1.410000e+02 1.990000e+02 4.120000e+02
infant deaths 1728.0 1.245428e+01 2.559142e+01 0.00000 0.000000 2.000000e+00 1.225000e+01 2.380000e+02
Alcohol 1728.0 4.439963e+00 3.629968e+00 0.01000 1.007500 4.260000e+00 7.010000e+00 1.243000e+01
percentage expenditure 1728.0 3.425669e+02 6.459938e+02 0.00000 3.984620 7.524183e+01 3.777792e+02 4.506256e+03
Hepatitis B 1728.0 8.809968e+01 1.175700e+01 36.00000 80.940461 9.300000e+01 9.700000e+01 9.900000e+01
Measles 1728.0 7.526817e+02 2.607044e+03 0.00000 0.000000 5.000000e+00 1.522500e+02 2.478900e+04
BMI 1728.0 4.112592e+01 1.886290e+01 2.00000 24.575000 4.700000e+01 5.630000e+01 7.730000e+01
under-five deaths 1728.0 1.691493e+01 3.560029e+01 0.00000 0.000000 3.000000e+00 1.500000e+01 3.310000e+02
Polio 1728.0 8.956056e+01 1.197609e+01 38.00000 85.750000 9.500000e+01 9.800000e+01 9.900000e+01
Total expenditure 1728.0 5.769482e+00 1.924121e+00 1.15000 4.530000 5.920000e+00 6.900000e+00 9.950000e+00
Diphtheria 1728.0 8.939929e+01 1.224797e+01 36.00000 86.000000 9.400000e+01 9.800000e+01 9.900000e+01
HIV/AIDS 1728.0 7.650463e-01 1.669615e+00 0.10000 0.100000 1.000000e-01 4.000000e-01 1.170000e+01
GDP 1728.0 4.740394e+03 5.812111e+03 1.68135 692.573517 3.179672e+03 5.815971e+03 3.281617e+04
Population 1728.0 8.479216e+06 1.270535e+07 123.00000 423820.750000 3.526114e+06 1.275338e+07 1.173189e+08
thinness 1-19 years 1728.0 4.058771e+00 2.914535e+00 0.10000 1.700000 3.200000e+00 6.325000e+00 1.360000e+01
thinness 5-9 years 1728.0 4.079678e+00 2.940589e+00 0.10000 1.700000 3.300000e+00 6.300000e+00 1.370000e+01
Income composition of resources 1728.0 6.711016e-01 1.338054e-01 0.28600 0.600750 6.950000e-01 7.690000e-01 9.480000e-01
Schooling 1728.0 1.229178e+01 2.558657e+00 5.30000 10.500000 1.250000e+01 1.410000e+01 1.840000e+01
count unique top freq
Country 1728 173 Iran (Islamic Republic of) 16
Status 1728 2 Developing 1466
Continent 1728 6 Africa 446

4.2 Univariate Graphs¶

4.2.1 Numeric Graphs¶

In [12]:
numeric_columns = df.select_dtypes(exclude='object').columns

_, axes = plt.subplots(4,5, figsize=(20,20))
for ind, col in enumerate(numeric_columns):
    sns.histplot(x=col,bins=10,kde=True,data=df, ax=axes.flatten()[ind])
    plt.xticks(rotation=45)

plt.show()

4.2.2 Categorical Graphs¶

In [13]:
# Excluding Country as it has a lot of unique values
cat_columns = df.select_dtypes(include='object').columns.drop('Country')

_, axes = plt.subplots(2, 1, figsize=(15,10))
for ind, col in enumerate(cat_columns):
    sns.countplot(y=col, data=df, ax=axes.flatten()[ind])
plt.show()

4.3 Bivariate Graphs¶

4.3.1 Numeric-Numeric Graphs¶

In [14]:
selected_columns = numeric_columns.drop('Life expectancy ')


_, axes = plt.subplots(4,5, figsize=(20,20))
for ind, col in enumerate(selected_columns):
    sns.scatterplot(x=col, y='Life expectancy ',  data=df, ax=axes.flatten()[ind])
    
plt.show() 
In [15]:
selected_columns = numeric_columns.drop('Polio')

_, axes = plt.subplots(4,5, figsize=(20,20))
for ind, col in enumerate(selected_columns):
    sns.scatterplot(x=col, y='Polio',  data=df, ax=axes.flatten()[ind])
    
plt.show() 

4.3.2 Categorical-Categorical Graphs¶

In [16]:
plt.figure(figsize=(15,10))
sns.countplot(y='Continent', hue='Status', data=df)
plt.show()

4.3 Advanced Plots¶

In [17]:
selected_columns = numeric_columns.drop('Life expectancy ')


_, axes = plt.subplots(7,3, figsize=(30,30))
for ind, col in enumerate(selected_columns):
    sns.scatterplot(x=col, y='Life expectancy ', hue='Continent', style='Status',
                    data=df, ax=axes.flatten()[ind])
plt.show() 

4.4 Graph Analysis Summary¶

We can observe the following from the histogram in 4.3.2:

  1. The majority of developed countries are located on the European continent.
  2. Africa is home to the most developing countries.
  3. There are essentially no developed countries in Africa, North America, South America, or Oceania.

We can see in 4.4 that:

  1. GDP and Percentage Experience have a moderately positive linear relationship. Furthermore, developed countries have a high GDP and percentage of experience.

  2. Income Composition of Resources and Schooling have a strong positive linear relationship. Furthermore, developed countries have higher education levels than developing countries.

  3. Polio and diphtheria have a strong positive linear relationship. Furthermore, industrialized countries have many fewer cases of diphtheria and polio than undeveloped countries.

  4. Diphtheria and Hepatitis B have a moderately positive linear relationship. Furthermore, developed countries have many fewer cases of Diphtheria and Polio than developing countries.

  5. Polio and Hepatitis B have a moderately positive linear relationship. Furthermore, developed countries have many fewer cases of diphtheria and polio than developing countries.

  6. Schooling and adult mortality have a moderately negative linear relationship. Furthermore, developed countries have better levels of education and lower adult mortality rates than developing countries.

  7. Life Expectancy and Schooling have a strong positive linear relationship*. Furthermore, developed countries have better levels of education and life expectancy than developing countries.

  8. Life Expectancy and Years have a strong positive linear relationship. Furthermore, developed countries have a higher Life Expectancy through time than developing countries.

  9. Life expectancy and adult mortality have a strong positive inversly linear relationship. Furthermore, adult mortality in developed countries is lower and life expectancy is higher. However, adult mortality is significantly greater in developing countries, and life expectancy is significantly lower.

  10. Life Expectancy and Alcohol have a moderately positive linear relationship. Furthermore, developed countries consume more alcohol and have longer life expectancies. However, developing countries have lower alcohol use and life expectancy than developed countries. As a result, the impact of alcohol on developed countries is reduced.

  11. Schooling and adult mortality have a moderately negative linear relationship. Furthermore, developed countries have better schooling and lower adult mortality rates than developing countries.

5. Model Building¶

In [18]:
# This function automatically finds the best method

def regressionAnalysis(input_col, output_y):
    # Checks for Multiple inputs
    if len(input_col) > 1:
        X = df.loc[:, input_col].values
        print(f'{" and ".join(input_col)} vs {output_y}')
    else:    
        X = df[input_col].values.reshape(-1, 1)  
        
    y = df[output_y].values.reshape(-1, 1)  
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    scaler = StandardScaler()
    scaler.fit(np.c_[X_train,y_train])

    A_train = scaler.transform(np.c_[X_train,y_train])
    X_train = A_train[:,:-1]
    y_train = A_train[:,-1]    
    A_test = scaler.transform(np.c_[X_test,y_test])
    X_test = A_test[:,:-1]
    y_test = A_test[:,-1]

    # OLS
    reg1 = LinearRegression(fit_intercept=False).fit(X_train, y_train)
    y_pred1 = reg1.predict(X_test)
    mse1 = round(mean_squared_error(y_test, y_pred1),5)
    print('The MSE using OLS is:', mse1)




    ## RidgeCV Analysis
    reg2 = RidgeCV(alphas=[1e-3, 1e-2, 1e-1, 1e0, 1e1, 1e2, 1e3], fit_intercept=False,cv=10).fit(X_train, y_train)
    y_pred2 = reg2.predict(X_test)
    mse2 = round(mean_squared_error(y_test, y_pred2),5)
    print('The MSE using Ridge is:', mse2)




    ## LassoCV Analysis
    reg3 = LassoCV(alphas=[1e-3, 1e-2, 1e-1, 1e0, 1e1, 1e2, 1e3],
                fit_intercept=False,cv=10, random_state=0).fit(X_train, y_train)
    y_pred3 = reg3.predict(X_test)
    mse3 = round(mean_squared_error(y_test, y_pred3),5)
    print('The MSE using Lasso is:', mse3)
    
    # Find MSE with smallest value.
    best_mse = round(min([mse1,mse2,mse2]),5)
    # 
    if best_mse is mse1:
        best_method_name = 'Linear Regression'
        best_method = LinearRegression()
    elif best_mse is mse2:
        best_method_name = 'Ridge Regression'
        best_method = RidgeCV()
    else:
        best_method_name = 'Lasso Regression'
        best_method = LassoCV()
    
    
    print(f'The best method with smallest MSE is {best_method_name} with {best_mse}' )
    
    # Creating a plot that shows each input and output variable and the slope. 
    for ind, col in enumerate(input_col):
        plt.figure(figsize=(5,5))
        x_values = np.arange(min(df[col]), max(df[col])).reshape(-1,1)
        best_method.fit(df[col].values.reshape(-1,1), df[output_y].values.reshape(-1,1))
        sns.scatterplot(x=col, y= output_y, hue='Continent', style='Status', data=df)
        y_head = best_method.predict(x_values)
        plt.plot(x_values, y_head, color="red")
        plt.title(f'{col} vs {output_y}')
        plt.xlabel(col)
        plt.ylabel(output_y)
        plt.show()

5.1 (Polio and Hepatitis B) vs Diphtheria¶

In [19]:
regressionAnalysis(['Polio', 'Hepatitis B'], 'Diphtheria ')
Polio and Hepatitis B vs Diphtheria 
The MSE using OLS is: 0.06927
The MSE using Ridge is: 0.06975
The MSE using Lasso is: 0.06928
The best method with smallest MSE is Lasso Regression with 0.06927

5.2 (Schooling and Income Composition of Resources) vs Life Expectancy¶

In [20]:
regressionAnalysis(['Schooling', 'Income composition of resources'], 'Life expectancy ')
Schooling and Income composition of resources vs Life expectancy 
The MSE using OLS is: 0.35212
The MSE using Ridge is: 0.35216
The MSE using Lasso is: 0.35217
The best method with smallest MSE is Lasso Regression with 0.35212

5.3 Summary and Analysis¶

  • As we can see from these results:

lasso regression performs better in both graphs and OLS performs worse meaning that most of the inputs except the selected ones are unrelated to the output so we can interpret that life expectancy and income does increase with schooling the same relationship can be said for Diphtheria with Polio and Hepatitis B. Moreover, a big factor affecting the output are the continents and the status of the countries as Africa and other developing countries have less schooling compared to developed countries while with diphtheria more developed countries and continents such as Asia and Europe deal with it.

6. Operationalize¶

6.1 Methodology¶

Our business objectives is aimed for insurance companies that can use our project in predicting the life expectency of a citizen in a country. Moreover, the insurance companies can estimate the cost of the insurance & the potential return in the unfortunate case of death. Furthermore, a prediction of the probability of death based on life expectency. In addition, our project can aid in optimizing the companies profit from the insurance service based on our data analysis.

Understanding our data We have used the data from Kaggle after deeply searching for an adequate dataset that can be used in our analysis. Furthermore, the data provide a realistic aspect of life expectations with a clear structure & quality information that can further aid our project. Subsequently, the limitations of this dataset enabled us to give a highly accurate assessment of futuristic prediction of life expectency.

Our plan Initially, we decided to apply descriptive analysis & exploratary analytics on our data to identify patterns. In addition, our work flow depended on how related each variables to other variables. Moreover, how all variables contribute in predicting the life expectency.

Tools & Technology We have depended heavily on matplotlib library, Seaborn library, pandas, and numpy. In addition, we have used linear regression as a method to predict feature and current life expectency based on features. Collabrative work Each team member brings an original point of view on our data and presented his opionion. Moreover, other members have assessed and elected an elite perspective on the data.

6.2 Issues with implementing methodology¶

Issues in the implementation of the selected methodology, Many issues could arise in a methodology. Consequently, this allows for enhancing our methodology & reevaluate our work. Fortunately, such issues were minimized in our endevour. However, some problems that could happen are as the following. First, Poor approach towards the data, such as poor descriptive analysis or explortary analytics could result in a misunderstanding of data. To elaborate, the most important part of a data scientist is an understanding of data , where we illustrate and emphasise the meaning of data. Second, Difficulty in establishing and connecting with the project's business objectives and goals. Followingly, high-quality data preparation and acquisition challenges for the project. Another possible problem is using poor technique that may not yield a useful result to the purpose of the project. Moreover, the technology used should be adequate in offering a value for the companies. Other issue, integration of the operationalized system with current systems and procedures is difficult.

7. Result¶

7.1 Summary and conclusion¶

Summary and conclusion of our analysis our implicit usage of the regression analysis have yielded several models which can predict certain features based on data analysis. (Polio and Hepatitis B) vs Diphtheria Initially, we have used regression technique to anticipate the countries Diphtheria vaccine on its population, diffrentiated using the continents to establish how geograph, GDP and other labels contribute on vaccine intake. Then, our regrission model have depended on the labels that took the vaccines for the Polio and Hepatitis B, the continent, and the status of the country. Followingly, we have found that both have a deep linear dependency which enables our mechanistic analytics to yield a realistic result. In elaboration, Most european contries are developed and a large proportion are predicted to take all three vaccines. However, Asian contries differ in thier status with the conclusion that developed asian countries are more likely to take the Polio, Hepatitis B, & the Diphtheria vaccines. In addition, African contries are more less likely to take the Polio vaccine, but they are likely to take the Hepatitis B vaccine, and appears that they are less likely to take the Diptheria vaccine based on our analysis. Furthermore, North american & south american countries are showing significant potential in being vaccined by the three vaccines.

Life expectancy vs (Income composition of resources, Schooling)

Another major aspect of our project is the regression model for the life expectency which is the main factor in predicting the cost of insurance. Additionally, the module can optimize profit for life insurance companies and aid in estimating the expected worth of life to pay in case of death based on the data provided. To conclude our finding regarding Life expectency, We have picked Schooling & Income composition of resources because they had the highest correlation among the variables. In addition, the best MSE was achieved in all 3 types. Moreover, Schooling or education is poor in africa due to most african countries being a developing country, and this resulted in a clear lower life expectency relative to other countries which we will discuss. In addition , Life expectency in developed and developing european countries are siginifcant. Hence, most european countries are predicted to have a high life expectency in future years. Moreover, most asian countries, south, and north america have moderate schooling which predicts a lower life expectency than european countries, but higher than african countries. Finally, in general developed countries are more likely to have a higher life expectence prediction in future years.

In conclusion, our regression model shows success in predicting both the life expectency & how much the population is willing to take the Dipththeria vaccine. Moreover, European developed and developing countries have a predicted low risk of dying, a high life expectancy with a population tendecy in taking the vaccine, and high schooling rates. North american & south american countries risk of dying is moderate, with a moderate life expectancy and also shows a moderate enforcing of the Diptheria vaccine to the population. Moreover, African countries schooling rates are much lower than relative countires, also most african countries are developing. Hence, african countries population have a high risk of dying, and a much lower life expectancy with a predicted take of diptheria vaccine if they take the polio vaccine. Oceania developed countries shows moderate life expectancy, and low acceptace to both vaccines and a low predicted take of the diptheria vaccine.


7.2 Future recommendation¶

Give possible future recommendations.

Based on our data & project's data analysis, our recommendation for life insurance companies is to offer lower costs for european citizens due to thier high life expectance. and make a moderate cost for North american countries with a similar cost for south american countries. In addition, Oceania's developing countries should have higher costs than Oceana's developed countries due to the varation of life expectancy. However, african countries population should have a massive life insurance cost due to thier extremly low life expecations relative to other countries. Hence Using This analysis insurace companies should be able to optimize thier profit and minimize the lost revenue.